US MASS SHOOTING DATA ANALYSIS¶

Prepared by: Kulwinder Kaur

In this notebook, I am working on US Mass Shooting data that has been taken from Kaggle. The data is from 1966 to 2017 comprising of upto 50 years worth of data.

The US has witnessed 398 mass shootings in last 50 years that resulted in 1,996 deaths and 2,488 injured. The latest and the worst mass shooting of October 2, 2017 killed 58 and injured 515 so far. The number of people injured in this attack is more than the number of people injured in all mass shootings of 2015 and 2016 combined.

Dataset: The dataset contains detailed information of 398 mass shootings in the United States of America that killed 1996 and injured 2488 people.

Variables: The dataset contains Serial No, Title, Location, Date, Summary, Fatalities, Injured, Total Victims, Mental Health Issue, Race, Gender, and Lat-Long information.

Goal: The purpose of the notebook is to clean the data, prepare a dash app to see the trend, understand geographical impact, impact in terms of lost lives, and shooter's health condition and demographic information. Also, I am aiming to prepare different kind of plotly charts to have a broader view of the data.

In [2]:
#importing the necessary libraries
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
In [3]:
#reading the data from csv file
df=pd.read_csv('Mass Shootings Dataset Ver 5.csv',encoding='latin-1', index_col='S#')
In [4]:
#checking the head
df.head()
Out[4]:
Title Location Date Incident Area Open/Close Location Target Cause Summary Fatalities Injured Total victims Policeman Killed Age Employeed (Y/N) Employed at Mental Health Issues Race Gender Latitude Longitude
S#
1 Texas church mass shooting Sutherland Springs, TX 11/5/2017 Church Close random unknown Devin Patrick Kelley, 26, an ex-air force offi... 26 20 46 0.0 26 NaN NaN No White M NaN NaN
2 Walmart shooting in suburban Denver Thornton, CO 11/1/2017 Wal-Mart Open random unknown Scott Allen Ostrem, 47, walked into a Walmart ... 3 0 3 0.0 47 NaN NaN No White M NaN NaN
3 Edgewood businees park shooting Edgewood, MD 10/18/2017 Remodeling Store Close coworkers unknown Radee Labeeb Prince, 37, fatally shot three pe... 3 3 6 0.0 37 NaN Advance Granite Store No Black M NaN NaN
4 Las Vegas Strip mass shooting Las Vegas, NV 10/1/2017 Las Vegas Strip Concert outside Mandala Bay Open random unknown Stephen Craig Paddock, opened fire from the 32... 59 527 585 1.0 64 NaN NaN Unclear White M 36.181271 -115.134132
5 San Francisco UPS shooting San Francisco, CA 6/14/2017 UPS facility Close coworkers NaN Jimmy Lam, 38, fatally shot three coworkers an... 3 2 5 0.0 38 1.0 NaN Yes Asian M NaN NaN
In [5]:
#checking the shape of the dataframe
df.shape
Out[5]:
(323, 20)
In [6]:
#printing out rows and columns
print(f" There are {df.shape[0]} data rows with {df.shape[1]} features." )
 There are 323 data rows with 20 features.
In [7]:
#checking the info
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 323 entries, 1 to 323
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Title                 323 non-null    object 
 1   Location              278 non-null    object 
 2   Date                  323 non-null    object 
 3   Incident Area         261 non-null    object 
 4   Open/Close Location   295 non-null    object 
 5   Target                318 non-null    object 
 6   Cause                 246 non-null    object 
 7   Summary               323 non-null    object 
 8   Fatalities            323 non-null    int64  
 9   Injured               323 non-null    int64  
 10  Total victims         323 non-null    int64  
 11  Policeman Killed      317 non-null    float64
 12  Age                   179 non-null    object 
 13  Employeed (Y/N)       67 non-null     float64
 14  Employed at           41 non-null     object 
 15  Mental Health Issues  323 non-null    object 
 16  Race                  321 non-null    object 
 17  Gender                323 non-null    object 
 18  Latitude              303 non-null    float64
 19  Longitude             303 non-null    float64
dtypes: float64(4), int64(3), object(13)
memory usage: 53.0+ KB
In [8]:
#separating out numerical and categorical columns
df_numeric=df.select_dtypes(include='number')
df_categorical=df.select_dtypes(exclude='number')
In [9]:
#summary statistics of the numerical dataset
df_numeric.describe()
Out[9]:
Fatalities Injured Total victims Policeman Killed Employeed (Y/N) Latitude Longitude
count 323.000000 323.000000 323.000000 317.000000 67.000000 303.000000 303.000000
mean 4.436533 6.176471 10.263158 0.129338 0.626866 37.225076 -94.429539
std 5.783208 29.889182 33.662309 0.610294 0.487288 5.536365 16.513296
min 0.000000 0.000000 3.000000 0.000000 0.000000 21.325512 -161.792752
25% 1.000000 1.000000 4.000000 0.000000 0.000000 33.571459 -110.205485
50% 3.000000 3.000000 5.000000 0.000000 1.000000 36.443290 -88.122998
75% 5.500000 5.000000 9.000000 0.000000 1.000000 41.483844 -81.703237
max 59.000000 527.000000 585.000000 5.000000 1.000000 60.790539 -69.707823

Observations:

  • There is maximum number of 59 fatalities with 5 policemen killed at mass shooting. However maximum number of victims recorder is 585 which is a huge number.
  • Some shootings involved no fatalities.
  • 75% of the mass shootings had the suspect employed.
  • It appears that Total victims column is the sum of Fatalities and Injured columns.
In [10]:
#taking a look at the categorical columns
df_categorical.sample(5)
Out[10]:
Title Location Date Incident Area Open/Close Location Target Cause Summary Age Employed at Mental Health Issues Race Gender
S#
79 Residence in Lakeland, Florida NaN 1/6/2016 Home Close drug dealer NaN Two men arrived at the house of a drug dealer ... NaN NaN Unknown Black American or African American Male
52 Excel Industries mass shooting Hesston, Kansas 2/25/2016 workplace Close random NaN Cedric L. Ford, who worked as a painter at a m... NaN manufacturing company Unclear Black M
33 Plantation, Florida Plantation, Florida 3/19/2016 Home Close party guests NaN A Spring Break party in a house packed with 40... NaN NaN Unknown Unknown Unknown
123 Daytona Beach Daytona Beach, Florida 4/3/2015 Home Close random NaN On April 3, 2015, four people were shot early ... NaN NaN No Black American or African American Male
61 Kalamazoo Uber Driver Spree Killing NaN 2/20/2016 NaN Open random NaN An Uber driver fired upon people in three loca... NaN uber No White American or European American Male

Observations:

  • There categorical columns include Title, location, Incident Area, Target, Summary, Age, Employed at, Mental health issue, Race and Gender.
  • The date column can be converted into datetime format.
In [11]:
#checking null values
df.isna().mean()*100
Out[11]:
Title                    0.000000
Location                13.931889
Date                     0.000000
Incident Area           19.195046
Open/Close Location      8.668731
Target                   1.547988
Cause                   23.839009
Summary                  0.000000
Fatalities               0.000000
Injured                  0.000000
Total victims            0.000000
Policeman Killed         1.857585
Age                     44.582043
Employeed (Y/N)         79.256966
Employed at             87.306502
Mental Health Issues     0.000000
Race                     0.619195
Gender                   0.000000
Latitude                 6.191950
Longitude                6.191950
dtype: float64

Employed at column has ~90% of the vlaues missing, followed by Employed(Y/N) with ~80% and Age with ~44% values missing.

Data Wrangling¶

Latitude and Longitude columns¶

The latitude and longitude column both seem to have missing same number of values at a time since both have missing value percentage as 6.2%. We also have Location column that can be used to fill in the missing values. We can google the coordinates of the corresponding Location and use that to fill in the missing latitude and longitude vlaues.

In [12]:
#cross-checking the rows with missing coordinates
df[df['Latitude'].isnull()][['Location', 'Incident Area', 'Latitude','Longitude']]
Out[12]:
Location Incident Area Latitude Longitude
S#
1 Sutherland Springs, TX Church NaN NaN
2 Thornton, CO Wal-Mart NaN NaN
3 Edgewood, MD Remodeling Store NaN NaN
5 San Francisco, CA UPS facility NaN NaN
6 Tunkhannock, PA Weis grocery NaN NaN
7 Orlando, Florida manufacturer Fiamma Inc. NaN NaN
8 Kirkersville, Ohio a nursing home NaN NaN
9 Fresno, California a street in downtown NaN NaN
10 Fort Lauderdale, Florida baggage claim area of the airport NaN NaN
11 Burlington, WA cosmetics section of a Macy’s department store NaN NaN
12 Baton Rouge, LA NaN NaN NaN
13 Dallas, Texas at protest NaN NaN
14 Orlando, Florida at nightclub NaN NaN
52 Hesston, Kansas workplace NaN NaN
58 Kalamazoo County, Michigan restaurant parking NaN NaN
81 San Bernardino, California Christmas Party NaN NaN
83 Colorado Springs, Colorado health clinic NaN NaN
109 Menasha, Wisconsin NaN NaN NaN
155 Santa Barbara, California NaN NaN NaN
157 Fort Hood, Texas Fort Hood Army Post NaN NaN
In [13]:
#latitude and longitude values from google
nan_coordinates= {1: [29.2732, -98.0564],
2: [39.903057, -104.954445],
3: [39.418719, -76.294402],
5: [37.828724, -122.355537],
6: [41.5386878, -75.94658800000002],
7: [28.5383355, -81.3792365],
8: [39.95886, -82.5965 ],
9: [36.746841, -119.772591],
10: [26.0702, -80.1517],
11: [48.4756621, -122.3254375],
12: [30.471165, -91.147385],
13: [32.779167,  -96.808891],
14: [28.538336, -81.379234],
52: [38.138344, -97.431427],
58: [-85.591736, 42.299152],
81: [34.115784, -117.302399],
83: [38.859055, -104.813499],
109: [44.2022, -88.4465],
155: [34.420830, -119.698189],
157: [31.195001, -97.741314]}
In [14]:
#filling missing values
for item in nan_coordinates.keys():
    df.loc[item, 'Latitude']=nan_coordinates[item][0]
    df.loc[item, 'Longitude']=nan_coordinates[item][1]

Location Column¶

In [15]:
#looking at missing location and corresponding title, lat and lon columns
df[df['Location'].isnull()][['Location', 'Title', 'Latitude','Longitude']]
Out[15]:
Location Title Latitude Longitude
S#
16 NaN Forestville, Maryland Drive-by 38.845113 -76.874972
17 NaN Halifax County, VA 36.765971 -78.928344
18 NaN Tire-Slashing revenge escalation 39.290385 -76.612189
19 NaN Chicago Rap video Shootout 41.878114 -87.629798
20 NaN Texas family murder-suicide 29.785785 -95.824396
21 NaN Alabama highway random shooting 34.162040 -86.475543
22 NaN Long Beach Street murder 33.770050 -118.193739
23 NaN Albuquerque, NM House party shooting 35.085334 -106.605553
24 NaN Memphis, TN gas station shooting 35.149534 -90.048980
25 NaN Chicago Birthday Party Bus Shooting 41.878114 -87.629798
26 NaN Albuquerque, NM Family restaurant shooting 35.085334 -106.605553
30 NaN Greenhill, AL Family murder-suicide 34.759257 -86.602493
35 NaN Atlanta Nightclub shooting 33.748995 -84.387982
36 NaN Florida Family celebration ends in shooting 26.640628 -81.872308
37 NaN Elberton, Georgia murder-suicide 34.111223 -82.867084
38 NaN Trenton, NJ sidewalk shooting 40.217053 -74.742938
39 NaN Detroit Drive-by 42.331427 -83.045754
41 NaN Wilkinsburg BBQ cookout shootout 40.441736 -79.881994
43 NaN Lafayette, LA drive by 30.224090 -92.019843
44 NaN Kansas City Home Invasion 39.114053 -94.627464
46 NaN Roswell, GA Hookah shootout 34.023243 -84.361556
47 NaN Wichita Nightclub shooting 37.687176 -97.330053
48 NaN Detroit Strip Club Shootout 42.331427 -83.045754
49 NaN Riverside restaurant shooting 33.997197 -117.485480
54 NaN Belfair, WA Family murder-suicide 47.451459 -122.826946
56 NaN Missouri Highway shooting 38.771440 -90.370949
57 NaN Houston Drive-by 29.760427 -95.369803
60 NaN Florida Hookah Bar Shooting 27.950575 -82.457178
61 NaN Kalamazoo Uber Driver Spree Killing 42.291707 -85.587229
62 NaN Vallejo, CA car shooting 38.104086 -122.256637
63 NaN Michigan school shooting 43.201126 -86.238946
65 NaN Nightclub fight turns into shootout 43.161030 -77.610922
66 NaN Tampa, FL Strip club shooting 27.950575 -82.457178
67 NaN Los Angeles, CA street shooting 34.052234 -118.243685
68 NaN Texas teen commits family murder suicide 29.209684 -99.786168
69 NaN New Orleans gas station shoot-out 29.951066 -90.071532
70 NaN Glendale, AZ House Party shooting 33.538652 -112.185987
71 NaN Caroline County, VA Family Murder Suicide 38.044556 -77.286488
73 NaN Perris, CA Mexican restaurant shooting 33.782519 -117.228648
74 NaN Florida family murder 30.754181 -86.572799
75 NaN Los Angeles Drive-by 34.052234 -118.243685
76 NaN Gloucester County, VA, House Party 37.402640 -76.458559
77 NaN Wilmington, DE robbery 39.739072 -75.539788
78 NaN Roadside in Memphis, Tennessee 35.149534 -90.048980
79 NaN Residence in Lakeland, Florida 28.039465 -81.949804

We can see that the Title column has the information about the city and state. We can split the column value by comma and separarte out the state part.

In [16]:
#splitting title with space just to see the values that are returned
df[df['Location'].isnull()]['Title'].str.split(' ')
Out[16]:
S#
16                   [Forestville,, Maryland, Drive-by]
17                               [Halifax, County,, VA]
18                 [Tire-Slashing, revenge, escalation]
19                      [Chicago, Rap, video, Shootout]
20                      [Texas, family, murder-suicide]
21                 [Alabama, highway, random, shooting]
22                        [Long, Beach, Street, murder]
23           [Albuquerque,, NM, House, party, shooting]
24               [Memphis,, TN, gas, station, shooting]
25            [Chicago, Birthday, Party, Bus, Shooting]
26     [Albuquerque,, NM, Family, restaurant, shooting]
30             [Greenhill,, AL, Family, murder-suicide]
35                       [Atlanta, Nightclub, shooting]
36    [Florida, Family, celebration, ends, in, shoot...
37                 [Elberton,, Georgia, murder-suicide]
38                   [Trenton,, NJ, sidewalk, shooting]
39                                  [Detroit, Drive-by]
41                [Wilkinsburg, BBQ, cookout, shootout]
43                          [Lafayette,, LA, drive, by]
44                       [Kansas, City, Home, Invasion]
46                     [Roswell,, GA, Hookah, shootout]
47                       [Wichita, Nightclub, shooting]
48                     [Detroit, Strip, Club, Shootout]
49                    [Riverside, restaurant, shooting]
54               [Belfair,, WA, Family, murder-suicide]
56                        [Missouri, Highway, shooting]
57                                  [Houston, Drive-by]
60                     [Florida, Hookah, Bar, Shooting]
61            [Kalamazoo, Uber, Driver, Spree, Killing]
62                        [Vallejo,, CA, car, shooting]
63                         [Michigan, school, shooting]
65            [Nightclub, fight, turns, into, shootout]
66                  [Tampa,, FL, Strip, club, shooting]
67                [Los, Angeles,, CA, street, shooting]
68      [Texas, teen, commits, family, murder, suicide]
69              [New, Orleans, gas, station, shoot-out]
70              [Glendale,, AZ, House, Party, shooting]
71     [Caroline, County,, VA, Family, Murder, Suicide]
73         [Perris,, CA, Mexican, restaurant, shooting]
74                            [Florida, family, murder]
75                             [Los, Angeles, Drive-by]
76             [Gloucester, County,, VA,, House, Party]
77                           [Wilmington,, DE, robbery]
78                  [Roadside, in, Memphis,, Tennessee]
79                  [Residence, in, Lakeland,, Florida]
Name: Title, dtype: object
In [17]:
#we can filter out the splitted title column in separate dataframe.
# splitting the string by comma and keeping one entry before and after comma 
# so that we can get the city and state values.
cf=df[df['Location'].isnull()]['Title'].apply(lambda x:  x.split(',')[0].split(' ')[-1] + ', '+ x.split(',')[1].split(' ')[1] if ',' in x else x.split(' ')[:2])
cf
Out[17]:
S#
16       Forestville, Maryland
17                  County, VA
18    [Tire-Slashing, revenge]
19              [Chicago, Rap]
20             [Texas, family]
21          [Alabama, highway]
22               [Long, Beach]
23             Albuquerque, NM
24                 Memphis, TN
25         [Chicago, Birthday]
26             Albuquerque, NM
30               Greenhill, AL
35        [Atlanta, Nightclub]
36           [Florida, Family]
37           Elberton, Georgia
38                 Trenton, NJ
39         [Detroit, Drive-by]
41          [Wilkinsburg, BBQ]
43               Lafayette, LA
44              [Kansas, City]
46                 Roswell, GA
47        [Wichita, Nightclub]
48            [Detroit, Strip]
49     [Riverside, restaurant]
54                 Belfair, WA
56         [Missouri, Highway]
57         [Houston, Drive-by]
60           [Florida, Hookah]
61           [Kalamazoo, Uber]
62                 Vallejo, CA
63          [Michigan, school]
65          [Nightclub, fight]
66                   Tampa, FL
67                 Angeles, CA
68               [Texas, teen]
69              [New, Orleans]
70                Glendale, AZ
71                  County, VA
73                  Perris, CA
74           [Florida, family]
75              [Los, Angeles]
76                  County, VA
77              Wilmington, DE
78          Memphis, Tennessee
79           Lakeland, Florida
Name: Title, dtype: object

It appears that some entries are received correctly with city and state values however some row values are returned as list where there was no comma in between string.

  • We can replace the item after comma with state abbreviations.
  • Later we can just keep the state abbreviations and remove the city part. By doing so we will atleast have all the entries with state names.
  • Finally we will replace the state abbrevaitions by full name so that it can be used for analysis.
In [18]:
df.iloc[48,:]
Out[18]:
Title                                       Riverside restaurant shooting
Location                                                              NaN
Date                                                            2/28/2016
Incident Area                                          restaurant parking
Open/Close Location                                                  Open
Target                                                             random
Cause                                                                 NaN
Summary                 Shooter fired upon group of people in a restau...
Fatalities                                                              1
Injured                                                                 3
Total victims                                                           4
Policeman Killed                                                      0.0
Age                                                                   NaN
Employeed (Y/N)                                                       NaN
Employed at                                                           NaN
Mental Health Issues                                              Unknown
Race                                                      Some other race
Gender                                                               Male
Latitude                                                        33.997197
Longitude                                                      -117.48548
Name: 49, dtype: object
In [19]:
#by visually checking the second element in cf title items and determining the state name 
#replacing the second element with the state abbreviation
#index 18 and index 49 have been filled with state where the coodinated show up in google
fill_loc={18: 'IL',
19: 'IL',
20: 'TX',
21: 'AL',
22: 'CA',
25: 'IL',
35: 'GA',
36: 'FL',
39: 'MI',
41: 'PA',
44: 'MO',
47: 'KS',
48: 'MI',
49: 'CA' ,
56: 'MO',
57: 'TX',
60: 'FL',
61: 'MI',
63: 'MI',
68: 'TX',
69: 'LA',
74: 'FL',
75: 'CA'}
In [20]:
#filling second item in cf by state abbreviations which were entered manually
for i in fill_loc.keys():
    cf[i][1]=fill_loc[i]
    cf[i]=', '.join(cf[i])
In [21]:
#filled Cf location
cf
Out[21]:
S#
16    Forestville, Maryland
17               County, VA
18        Tire-Slashing, IL
19              Chicago, IL
20                Texas, TX
21              Alabama, AL
22                 Long, CA
23          Albuquerque, NM
24              Memphis, TN
25              Chicago, IL
26          Albuquerque, NM
30            Greenhill, AL
35              Atlanta, GA
36              Florida, FL
37        Elberton, Georgia
38              Trenton, NJ
39              Detroit, MI
41          Wilkinsburg, PA
43            Lafayette, LA
44               Kansas, MO
46              Roswell, GA
47              Wichita, KS
48              Detroit, MI
49            Riverside, CA
54              Belfair, WA
56             Missouri, MO
57              Houston, TX
60              Florida, FL
61            Kalamazoo, MI
62              Vallejo, CA
63             Michigan, MI
65       [Nightclub, fight]
66                Tampa, FL
67              Angeles, CA
68                Texas, TX
69                  New, LA
70             Glendale, AZ
71               County, VA
73               Perris, CA
74              Florida, FL
75                  Los, CA
76               County, VA
77           Wilmington, DE
78       Memphis, Tennessee
79        Lakeland, Florida
Name: Title, dtype: object
In [22]:
#splitting the state part as separate column
cf= cf.str.split(pat=',', n=1, expand=True)
cf
Out[22]:
0 1
S#
16 Forestville Maryland
17 County VA
18 Tire-Slashing IL
19 Chicago IL
20 Texas TX
21 Alabama AL
22 Long CA
23 Albuquerque NM
24 Memphis TN
25 Chicago IL
26 Albuquerque NM
30 Greenhill AL
35 Atlanta GA
36 Florida FL
37 Elberton Georgia
38 Trenton NJ
39 Detroit MI
41 Wilkinsburg PA
43 Lafayette LA
44 Kansas MO
46 Roswell GA
47 Wichita KS
48 Detroit MI
49 Riverside CA
54 Belfair WA
56 Missouri MO
57 Houston TX
60 Florida FL
61 Kalamazoo MI
62 Vallejo CA
63 Michigan MI
65 NaN NaN
66 Tampa FL
67 Angeles CA
68 Texas TX
69 New LA
70 Glendale AZ
71 County VA
73 Perris CA
74 Florida FL
75 Los CA
76 County VA
77 Wilmington DE
78 Memphis Tennessee
79 Lakeland Florida

We can separate out city and state from the location column for extra analysis.

In [23]:
#joining the state part of the location to actual df
df=df.join(df['Location'].str.split(pat=',',n= 1,expand=True).rename(columns={0: 'City', 1:'State'}))
In [24]:
#replacing state names with the state names in cleaned CF state column
df.loc[df.index.isin(cf.index), "State"]=cf[1]
In [25]:
df.loc[df['State'].isna()]
Out[25]:
Title Location Date Incident Area Open/Close Location Target Cause Summary Fatalities Injured ... Age Employeed (Y/N) Employed at Mental Health Issues Race Gender Latitude Longitude City State
S#
65 Nightclub fight turns into shootout NaN 2/7/2016 nightclub Close random anger Two groups of people fired at each other after... 1 7 ... NaN NaN NaN Unknown Unknown Unknown 43.161030 -77.610922 NaN NaN
164 Washington Navy Yard Washington D.C. 9/16/2013 NaN Close random terrorism On September 16, 2013, a 34-yearl old contract... 13 3 ... 34 1.0 Navy Yard Yes Black American or African American Male 38.904809 -77.016297 Washington D.C. None

2 rows × 22 columns

In [26]:
#manualy filling the two null values
df.loc[65, 'State']='NY'
df.loc[164, 'State']='Washington D.C.'
In [27]:
#creating a state abbreviation dictionary from google 
states_abb= {
'AL': 'Alabama',
'KY': 'Kentucky',
'OH': 'Ohio',
'AK': 'Alaska',
'LA': 'Louisiana',
'OK': 'Oklahoma',
'AZ': 'Arizona',
'ME': 'Maine',
'OR': 'Oregon',
'AR': 'Arkansas',
'MD': 'Maryland',
'PA': 'Pennsylvania',
'AS': 'American Samoa',
'MA': 'Massachusetts',
'PR': 'Puerto Rico',
'CA': 'California',
'MI': 'Michigan',
'RI': 'Rhode Island',
'CO': 'Colorado',
'MN': 'Minnesota',
'SC': 'South Carolina',
'CT': 'Connecticut',
'MS': 'Mississippi',
'SD': 'South Dakota',
'DE': 'Delaware',
'MO': 'Missouri',
'TN': 'Tennessee',
'DC': 'District of Columbia',
'MT': 'Montana',
'TX': 'Texas',
'FL': 'Florida',
'NE': 'Nebraska',
'GA': 'Georgia',
'NV': 'Nevada',
'UT': 'Utah',
'GU': 'Guam',
'NH': 'New Hampshire',
'VT': 'Vermont',
'HI': 'Hawaii',
'NJ': 'New Jersey',
'VA': 'Virginia',
'ID': 'Idaho',
'NM': 'New Mexico',
'VI': 'Virgin Islands',
'IL': 'Illinois',
'NY': 'New York',
'WA': 'Washington',
'IN': 'Indiana',
'NC': 'North Carolina',
'WV': 'West Virginia',
'IA': 'Iowa',
'ND': 'North Dakota',
'WI': 'Wisconsin',
'KS': 'Kansas',
'MP': 'Northern Mariana Islands',
'WY': 'Wyoming',
'TT': 'Trust Territories'
}
In [28]:
#replacing the abbreviation by name
for i in states_abb.keys():
    df.loc[df['State'].str.strip()==i, 'State']=states_abb[i]
In [29]:
df.isna().sum()
Out[29]:
Title                     0
Location                 45
Date                      0
Incident Area            62
Open/Close Location      28
Target                    5
Cause                    77
Summary                   0
Fatalities                0
Injured                   0
Total victims             0
Policeman Killed          6
Age                     144
Employeed (Y/N)         256
Employed at             282
Mental Health Issues      0
Race                      2
Gender                    0
Latitude                  0
Longitude                 0
City                     45
State                     0
dtype: int64
In [30]:
#cross checking values
df['State']
Out[30]:
S#
1            Texas
2         Colorado
3         Maryland
4           Nevada
5       California
          ...     
319       Illinois
320      Louisiana
321     Washington
322        Arizona
323          Texas
Name: State, Length: 323, dtype: object
In [31]:
#crosschecking
df['State'].value_counts().index
Out[31]:
Index([' California', ' Florida', ' Texas', ' Georgia', ' Washington',
       ' Arizona', ' North Carolina', ' Ohio', ' New York', ' Alabama',
       ' Wisconsin', ' Illinois', 'California', ' Colorado', ' Tennessee',
       ' Michigan', ' Pennsylvania', ' Virginia', ' South Carolina',
       ' Kentucky', ' Oklahoma', ' Nevada', ' Mississippi', ' Minnesota',
       'Michigan', ' Louisiana', ' Massachusetts', ' Kansas', ' Oregon',
       'Texas', 'Florida', ' New Jersey', 'Louisiana', ' Missouri', 'Virginia',
       ' Connecticut', 'Illinois', ' Nebraska', 'Washington', ' Arkansas',
       ' Montana', 'Pennsylvania', 'Alabama', ' Utah', 'New Mexico', 'Georgia',
       'Missouri', ' New Mexico',
       ' Souderton, Lansdale, Harleysville, Pennsylvania',
       ' Lancaster, Pennsylvania', ' Vermont', ' Albuquerque, New Mexico',
       'Washington D.C.', ' San Diego, California', ' Hawaii', ' Alaska',
       ' Wyoming', 'Maryland', ' West Virginia', 'New Jersey', ' Idaho',
       'Nevada', ' Indiana', ' Maryland', ' South Dakota', ' Maine',
       'Delaware', 'Arizona', 'New York', 'Tennessee', 'Colorado', ' Texas ',
       '  Virginia', 'Kansas', ' Iowa'],
      dtype='object')
In [32]:
#it seems that state names have space infront of the name 
#stripping the space on both sides
df['State']=df['State'].str.strip()
In [33]:
#rechecking
df['State'].value_counts().index
Out[33]:
Index(['California', 'Florida', 'Texas', 'Georgia', 'Washington', 'Arizona',
       'Illinois', 'New York', 'North Carolina', 'Alabama', 'Michigan', 'Ohio',
       'Virginia', 'Wisconsin', 'Pennsylvania', 'Louisiana', 'Tennessee',
       'Colorado', 'Nevada', 'South Carolina', 'Oklahoma', 'Kansas',
       'Missouri', 'Kentucky', 'Massachusetts', 'Minnesota', 'Oregon',
       'New Mexico', 'New Jersey', 'Mississippi', 'Nebraska', 'Connecticut',
       'Montana', 'Maryland', 'Utah', 'Arkansas', 'Wyoming', 'Alaska',
       'Hawaii', 'San Diego, California', 'Vermont', 'Lancaster, Pennsylvania',
       'Delaware', 'Albuquerque, New Mexico', 'Washington D.C.',
       'West Virginia', 'Souderton, Lansdale, Harleysville, Pennsylvania',
       'Idaho', 'Indiana', 'South Dakota', 'Maine', 'Iowa'],
      dtype='object')

We still have some states with city name followed by comma and then the state name. We will try to remove that below.

In [34]:
#removing city name
df.loc[df['State'].str.contains(','), 'State']= df.loc[df['State'].str.contains(',')]['State'].str.split(' ').str[-1]
In [35]:
df['State'].value_counts().index
Out[35]:
Index(['California', 'Florida', 'Texas', 'Washington', 'Georgia', 'Arizona',
       'Illinois', 'North Carolina', 'Alabama', 'New York', 'Pennsylvania',
       'Ohio', 'Michigan', 'Virginia', 'Wisconsin', 'Colorado', 'Louisiana',
       'Tennessee', 'Nevada', 'Missouri', 'South Carolina', 'Kansas',
       'Oklahoma', 'Kentucky', 'Oregon', 'Minnesota', 'Mississippi',
       'Massachusetts', 'New Jersey', 'New Mexico', 'Nebraska', 'Connecticut',
       'Montana', 'Maryland', 'Utah', 'Arkansas', 'Wyoming', 'Alaska',
       'Hawaii', 'Vermont', 'West Virginia', 'Mexico', 'Washington D.C.',
       'Delaware', 'Idaho', 'Indiana', 'South Dakota', 'Maine', 'Iowa'],
      dtype='object')
In [36]:
#doing a value count
df['State'].value_counts()
Out[36]:
California         37
Florida            25
Texas              21
Washington         16
Georgia            16
Arizona            12
Illinois           12
North Carolina     11
Alabama            11
New York           11
Pennsylvania       10
Ohio               10
Michigan           10
Virginia            9
Wisconsin           9
Colorado            7
Louisiana           7
Tennessee           7
Nevada              6
Missouri            5
South Carolina      5
Kansas              5
Oklahoma            5
Kentucky            5
Oregon              4
Minnesota           4
Mississippi         4
Massachusetts       4
New Jersey          4
New Mexico          4
Nebraska            3
Connecticut         3
Montana             2
Maryland            2
Utah                2
Arkansas            2
Wyoming             1
Alaska              1
Hawaii              1
Vermont             1
West Virginia       1
Mexico              1
Washington D.C.     1
Delaware            1
Idaho               1
Indiana             1
South Dakota        1
Maine               1
Iowa                1
Name: State, dtype: int64
In [37]:
# we will also add another column for state abbreviation so that it can be used in making chrolopleth charts
def find_keys(x):
    for i in states_abb:
        if states_abb[i]==x:
            return i

df['State_abb']=df['State'].apply(lambda x : find_keys(x))
df['State_abb']
Out[37]:
S#
1      TX
2      CO
3      MD
4      NV
5      CA
       ..
319    IL
320    LA
321    WA
322    AZ
323    TX
Name: State_abb, Length: 323, dtype: object

Employed AT and Employeed (Y/N)¶

Since we saw that Employed At and Employeed (Y/N) columns had very large number of null values so we will be dropping them from the dataframe.

In [38]:
#dropping the columns in place
df.drop(columns=['Employed at', 'Employeed (Y/N)'], inplace=True)

Age column¶

In [39]:
#checking the summary column where age is null particularly row 18.
df[df['Age'].isna()][['Age', 'Summary']].loc[18,'Summary']
Out[39]:
"Shooter was angry over fact that rival families had been slashing each other's cars. Shooter fired into a block party and injured four."

It appears that summary has no detail about the age of the person shooting.

In [40]:
#checking value counts
df['Age'].value_counts()
Out[40]:
44       7
14       7
34       7
18       7
37       6
15       6
45       6
43       6
39       6
20       6
40       6
19       6
41       6
17       6
36       5
22       5
29       5
42       5
28       5
24       5
23       5
16       4
26       4
25       4
32       3
47       3
27       3
59       3
35       3
48       2
64       2
55       2
30       2
57       2
38       2
52       2
31       1
15,16    1
17,18    1
50       1
13,11    1
33       1
13       1
66       1
19,22    1
19,32    1
70       1
12       1
21       1
0        1
51       1
Name: Age, dtype: int64
In [41]:
#number of rows with missing values at Age column
df['Age'].isna().sum()
Out[41]:
144

In some rows we have two age numbers, assumming that the first number corresponds to the age of the suspect and second number corresponds to the age of the victim. We will remove the second number from this column.

In [42]:
df['Age']=df['Age'].str.split(',').str.get(0)

We will fill the missing age with '1000' so that we can convert all the numbers in string format to numbers i.e int.

In [43]:
df['Age']=df["Age"].fillna('1000')
In [44]:
#converting the age number from string to int
df["Age"]=df["Age"].astype('int')

For analysis purpose, we are defining the age groups of the people that fall in below buckets:

  • Less than 16 years - child
  • Between 17 to 30 - Young Adult
  • Between 31 to 45 - Middle Aged Adults
  • Between 46 to 65 - Old Aged
  • Above 65 - Elderly
  • Age with number 1000 - Unknown
In [45]:
def age_groups(age : int):
    if age==1000:
        group='Unknown'
    elif age <= 16:
        group= 'Child'
    elif (age >= 17) & (age < 31):
        group= 'Young Adults'
            
    elif (age >= 31) & (age < 46):
        group='Middle Aged Adults'
            
    elif (age >= 46) & (age < 66):
        group='Old Adults'
            
    else:
        group='Elderly'
            
    return group
In [46]:
df['Age Group']= df['Age'].apply(lambda x: age_groups(x))
In [47]:
df['Age Group'].value_counts()
Out[47]:
Unknown               144
Middle Aged Adults     70
Young Adults           67
Child                  22
Old Adults             18
Elderly                 2
Name: Age Group, dtype: int64
In [48]:
#rechecking missing value counts
df.isna().sum()
Out[48]:
Title                    0
Location                45
Date                     0
Incident Area           62
Open/Close Location     28
Target                   5
Cause                   77
Summary                  0
Fatalities               0
Injured                  0
Total victims            0
Policeman Killed         6
Age                      0
Mental Health Issues     0
Race                     2
Gender                   0
Latitude                 0
Longitude                0
City                    45
State                    0
State_abb                2
Age Group                0
dtype: int64

Race column¶

In [49]:
df['Race'].value_counts()
Out[49]:
White American or European American                    122
Black American or African American                      76
Unknown                                                 42
Some other race                                         20
white                                                   12
Asian American                                          11
White                                                    9
Asian                                                    6
Black                                                    5
Latino                                                   5
Native American or Alaska Native                         3
black                                                    3
Other                                                    2
Two or more races                                        2
Black American or African American/Unknown               1
White American or European American/Some other Race      1
Asian American/Some other race                           1
Name: Race, dtype: int64
In [50]:
df['Race']= df['Race'].fillna('Unknown')

There are some rows where Race has two values separarted by '/'. We can keep the first one and remove the second one.

In [51]:
df['Race']=df['Race'].str.split('/').str.get(0)
In [52]:
df.loc[df['Race']=='white', 'Race']='White'
df.loc[df['Race']=='White American or European American', 'Race']='White'
df.loc[df['Race']=='black', 'Race']='Black'
df.loc[df['Race']=='Black American or African American', 'Race']='Black'
df.loc[df['Race']=='Asian American', 'Race']='Asian'
df.loc[df['Race']=='Some other race', 'Race']='Other'
df.loc[df['Race']=='Two or more races', 'Race']='Other'
In [53]:
df['Race'].value_counts()
Out[53]:
White                               144
Black                                85
Unknown                              44
Other                                24
Asian                                18
Latino                                5
Native American or Alaska Native      3
Name: Race, dtype: int64

Policeman Killed¶

In [54]:
df[df['Policeman Killed'].isna()]['Summary']
Out[54]:
S#
6     Randy Stair, a 24-year-old worker at Weis groc...
7     John Robert Neumann, Jr., 45, a former employe...
9     Kori Ali Muhammad, 39, opened fire along a str...
10    Esteban Santiago, 26, flew from Alaska to Fort...
11    Arcan Cetin, 20, killed a teen girl and three ...
28    Three people were killed in a shooting at a ho...
Name: Summary, dtype: object

After reading the summary for the these columns with missing value for 'Policeman Killed', found that no police man was killed so filling the nan values with 0.

In [55]:
df['Policeman Killed']=df['Policeman Killed'].fillna(0)

Cause¶

In [56]:
df['Cause'].value_counts()
Out[56]:
psycho                  68
terrorism               65
anger                   44
frustration             18
domestic dispute        12
unemployement           10
revenge                 10
racism                   6
unknown                  4
failing exams            3
domestic disputer        1
suspension               1
religious radicalism     1
drunk                    1
breakup                  1
robbery                  1
Name: Cause, dtype: int64
In [57]:
#filling the NAN with unknown
df['Cause']=df['Cause'].fillna('unknown')

Mental Health Issues¶

In [58]:
df['Mental Health Issues'].value_counts()
Out[58]:
Unknown    110
Yes        106
No          93
Unclear     13
unknown      1
Name: Mental Health Issues, dtype: int64
In [59]:
df.loc[df['Mental Health Issues']=='unknown', 'Mental Health Issues']='Unknown'
In [60]:
#### Gender
In [61]:
df['Gender'].value_counts()
Out[61]:
Male           272
Unknown         21
M               20
Female           5
Male/Female      4
M/F              1
Name: Gender, dtype: int64
In [62]:
#mapping the correct names
df['Gender']=df['Gender'].map({'Female': 'Female',
                   "M": "Male",
                   'M/F': "Male/Female",
                   'Male/Female': 'Male/Female',
                   "Unknown": "Unknown",
                    "Male": "Male"})
In [63]:
df['Gender'].value_counts()
Out[63]:
Male           292
Unknown         21
Male/Female      5
Female           5
Name: Gender, dtype: int64
In [64]:
df['Gender']= df['Gender'].fillna('Unknown')

Target¶

In [65]:
#value counts on target
df['Target'].value_counts()
Out[65]:
random                       140
Family                        37
Coworkers                     17
Students                      16
Students+Teachers             14
Ex-Coworkers                   9
party guests                   8
Ex-Wife                        7
Policeman                      7
coworkers                      5
Teachers                       5
neighbors                      4
Ex-Wife & Family               3
Friends                        2
school girls                   2
Children                       2
police                         2
Ex-Girlfriend                  2
Sikhs                          1
black men                      1
Congresswoman                  1
Policeman+Council Member       1
postmaster                     1
lawyers                        1
basketball players             1
psychologist+psychiatrist      1
Ex-GirlFriend                  1
hunters                        1
Social Workers                 1
Family+students                1
Students+Parents               1
prayer group                   1
monks                          1
Family+random                  1
Coworker's Family              1
Girlfriend                     1
women                          1
uninvited guests               1
birthday party bus             1
Trooper                        1
club members                   1
Family/Neighbors               1
drug dealer                    1
protestors                     1
Marines                        1
Ex-girlfriend                  1
House Owner                    1
Contestant                     1
Ex-Girlfriend & Family         1
Ex-Girlfriend+random           1
rapper+random                  1
TSA Officer                    1
partner's family               1
welding shop employees         1
Name: Target, dtype: int64
In [66]:
df['Target']=df['Target'].fillna('Unknown')
In [67]:
#mapping the correct names to values
df['Target_group']= df['Target'].map({'random': 'Random',
                                    'Ex-girlfriend': 'Ex-girlfriend',
                                    'Ex-GirlFriend': 'Ex-girlfriend',
                                     'Ex-Girlfriend': 'Ex-girlfriend',
                                     'Ex-Girlfriend & Family': 'Ex-girlfriend',
                                     'Ex-Girlfriend+random': 'Ex-girlfriend',
                                    'Coworkers' : 'Coworkers',
                                    'coworkers': 'Coworkers',
                                    'Family': 'Family',
                                    'Students': 'Students',
                                    'party guests': 'Guests',
                                    'uninvited guests': 'Guests',
                                    'Students+Teachers': 'Students',
                                    'women': 'Random',
                                     'school girls': 'Students',
                                     'welding shop employees': 'Employees',
                                     'club members': 'Employees',
                                     'birthday party bus': 'Guests',
                                     'Ex-Coworkers': 'Ex-Coworkers',
                                     'Ex-Wife': 'Ex-Wife',
                                     'Policeman': 'Police',
                                     "partner's family" : "Partner's Family",
                                     "TSA Officer": "Government Officer",
                                     "Trooper" : "Trooper",
                                     "rapper+random" : "Rapper",
                                     "Contestant": "Contestant", 
                                     "House Owner": "House Owner",
                                     "Marines": "Marines",
                                     "protestors": "Protestors",
                                     "drug dealer": "Drug Dealer",
                                     "Family/Neighbors" : "Family",
                                     "police": "Police",
                                     "Family+random": "Family",
                                     "Policeman+Council Member": "Police",
                                     "monks": "Religious Leader",
                                     "black men": "Black men",
                                     "Sikhs": "Sikhs",
                                     "psychologist+psychiatrist": "Psychiatrist",
                                     "basketball players": "Basketball players",
                                     "Children": "Children",
                                     "Friends": "Friends",
                                     "Social Workers": "Social Workers",
                                     "Unknown": "Unknown",
                                     "neighbors": "Neighbors",
                                     "Ex-Wife & Family": "Ex-Wife",
                                     "Congresswoman": "Congresswoman",
                                     "hunters": "hunters",
                                     "Students+Parents": "Students",
                                     "postmaster": "Postmaster",
                                     "Teachers": "Teachers",
                                     "Coworker's Family": "Coworker",
                                     "Girlfriend": "Girlfriend",
                                     "lawyers": "Lawyers",
                                     "prayer group": "Prayer group",
                                     "Family+students": "Students"}
                                    )

Date¶

In [68]:
#converting Date to date time format and extracting the features
df['Date']=pd.to_datetime(df['Date'])
df['Year']=df['Date'].dt.year
df['Month']=df['Date'].dt.month
df['Weekday']=df['Date'].dt.day
In [69]:
print(f"We have data from {df['Year'].min()} till {df['Year'].max()}.")
We have data from 1966 till 2017.

Saving the cleaned data to new CSV file¶

In [70]:
#finally saving the cleaned data as separate csv file
df.to_csv('cleaned_out.csv', index=False)

Reading the cleaned data¶

In [71]:
#importing cleaned data
df= pd.read_csv('cleaned_out.csv')
In [72]:
#reading head
df.head(2)
Out[72]:
Title Location Date Incident Area Open/Close Location Target Cause Summary Fatalities Injured ... Latitude Longitude City State State_abb Age Group Target_group Year Month Weekday
0 Texas church mass shooting Sutherland Springs, TX 2017-11-05 Church Close random unknown Devin Patrick Kelley, 26, an ex-air force offi... 26 20 ... 29.273200 -98.056400 Sutherland Springs Texas TX Young Adults Random 2017 11 5
1 Walmart shooting in suburban Denver Thornton, CO 2017-11-01 Wal-Mart Open random unknown Scott Allen Ostrem, 47, walked into a Walmart ... 3 0 ... 39.903057 -104.954445 Thornton Colorado CO Old Adults Random 2017 11 1

2 rows × 26 columns

In [73]:
#checking info
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 323 entries, 0 to 322
Data columns (total 26 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Title                 323 non-null    object 
 1   Location              278 non-null    object 
 2   Date                  323 non-null    object 
 3   Incident Area         261 non-null    object 
 4   Open/Close Location   295 non-null    object 
 5   Target                323 non-null    object 
 6   Cause                 323 non-null    object 
 7   Summary               323 non-null    object 
 8   Fatalities            323 non-null    int64  
 9   Injured               323 non-null    int64  
 10  Total victims         323 non-null    int64  
 11  Policeman Killed      323 non-null    float64
 12  Age                   323 non-null    int64  
 13  Mental Health Issues  323 non-null    object 
 14  Race                  323 non-null    object 
 15  Gender                323 non-null    object 
 16  Latitude              323 non-null    float64
 17  Longitude             323 non-null    float64
 18  City                  278 non-null    object 
 19  State                 323 non-null    object 
 20  State_abb             321 non-null    object 
 21  Age Group             323 non-null    object 
 22  Target_group          323 non-null    object 
 23  Year                  323 non-null    int64  
 24  Month                 323 non-null    int64  
 25  Weekday               323 non-null    int64  
dtypes: float64(3), int64(7), object(16)
memory usage: 65.7+ KB
In [74]:
#converting Date column to DateTime format
df['Date']=pd.to_datetime(df['Date'])

Plotly Data Analysis Charts¶

1. Total Mass Shooting Incidences Per Year¶

In [75]:
df_plot= df.groupby('Year')[['Date']].count()
fig_1=px.bar(df_plot, x=df_plot.index, y='Date', labels={'Date': '# cases'})
fig_1.update_layout(title='US mass shooting incidences per year',
                 xaxis_title='Year', yaxis_title='Number of cases', xaxis = dict(
        tickmode = 'linear', dtick = 2))
fig_1.update_xaxes(tickangle=30)
fig_1.show()

After analysing 50 years of data for US Mass shooting across various states:

  • It can be seen that the number of cases are increasing exponentially since 1966.
  • Highest number of incidences were recorded in 2016 with 69 such cases.
  • It can be expected that these numbers are going to increase in coming years.

2. History of US Mass Shooting (in terms of Fatalities, Injured and Total Victims)¶

In [76]:
df_plot=df.groupby(['Year']).agg({'Date': 'count', 'Fatalities': 'sum', 'Injured': 'sum', 'Total victims': 'sum'}).reset_index().sort_values('Year', ascending=True)

trace1 = go.Scatter(x=df_plot['Year'],
                    y=df_plot['Fatalities'],
                    mode='lines+markers',
                    line=dict(width=1.5),
                   name='Fatalities')

trace2 = go.Scatter(x = df_plot['Year'],
                    y = df_plot['Injured'],
                    mode='lines+markers',
                    line=dict(width=1.5),
                   name='Injured')

trace3 = go.Scatter(x = df_plot['Year'],
                    y = df_plot['Total victims'],
                    mode='lines+markers',
                    line=dict(width=1.5),
                   name='Total victims')

frames = [dict(data= [dict(type='scatter',
                           x=df_plot['Year'][:k+1],
                           y=df_plot['Fatalities'][:k+1]),
                      dict(type='scatter',
                           x=df_plot['Year'][:k+1],
                           y=df_plot['Injured'][:k+1]),
                     dict(type='scatter',
                           x=df_plot['Year'][:k+1],
                           y=df_plot['Total victims'][:k+1])],
               traces= [0, 1, 2],  
              )for k  in  range(1, len(df_plot['Year'].unique().tolist()))] 

layout = go.Layout(#width=650,
                   #height=400,
                   title='History of US mass shooting in terms of fatalities/injured/total_victims',
                   xaxis_title='Year',
                   yaxis_title='Number of individuals affected',
                   showlegend=False,
                   hovermode='closest',
                   updatemenus=[dict(type='buttons', showactive=False,
                                y=1.05,
                                x=1.15,
                                xanchor='right',
                                yanchor='top',
                                pad=dict(t=0, r=10),
                                buttons=[dict(label='Play',
                                              method='animate',
                                              args=[None, 
                                                    dict(frame=dict(duration=100, 
                                                                    redraw=False),
                                                         transition=dict(duration=8),
                                                         fromcurrent=True,
                                                         mode='immediate')])])])


layout.update(xaxis =dict(range=[1965, 2020]),
              yaxis =dict(range=[0, 700]));
fig_2 = go.Figure(data=[trace1, trace2, trace3], frames=frames, layout=layout)
fig_2.update_layout(hovermode="x")
fig_2.show()

3. Shooter's Gender and affected people in terms of total victims¶

In [77]:
fig_3=px.scatter(df, y='Year', x='Month', size= 'Total victims', color='Gender', 
               hover_name="State", hover_data=["Fatalities", "Injured", "Policeman Killed"])
fig_3.update_layout(
        title="Shooter's Gender and affected people in terms of total victims <br> (Hover over the markers to see more details)",
        xaxis=dict(
        title='Month',
        tickmode='linear'),
        yaxis=dict(
        title='Year',
        tickmode='linear',
        dtick=2))

fig_3.show()
  • It can be clearly seen from the chart that Male are mostly behind these mass shooting cases. A male individual was responsible for impacting 585 people in a single mass shooling case of Nevada in year 2017.
  • There are only five cases where the shooter was a female.
  • There is no clear pattern as to which month had recorded more number of cases.

4. Heat map to see any monthly pattern¶

In [78]:
from plotly.subplots import make_subplots
fig_4 = make_subplots(rows=1, cols=3, shared_yaxes=True, subplot_titles=('Total victims', 'Injured', 'Fatalities'))

fig_4.add_trace(go.Heatmap(
        z=df['Total victims'],
        x=df['Month'],
        y=df['Year'],coloraxis='coloraxis', xaxis='x'
        ), row=1, col=1)
fig_4.add_trace(go.Heatmap(
        z=df['Fatalities'],
        x=df['Month'],
        y=df['Year'],coloraxis='coloraxis2' , xaxis='x1'
        ), row=1, col=3)
fig_4.add_trace(go.Heatmap(
        z=df['Injured'],
        x=df['Month'],
        y=df['Year'],coloraxis='coloraxis3' , xaxis='x2'
        ), row=1, col=2)
fig_4.update_layout(
    title='Total Victims per month in a year',
    )
fig_4.update_layout(xaxis=dict(
        title='Month',
        tickmode='linear'),
        yaxis=dict(
        title='Year',
        tickmode='linear',
        dtick=2))
fig_4.update_layout(xaxis2=dict(
        title='Month',
        tickmode='linear'),
        xaxis3=dict(
        title='Month',
        tickmode='linear')
        )
fig_4.update_layout(
                  coloraxis=dict(colorscale='spectral', colorbar_y=0.25, colorbar_thickness=23, colorbar_len= 0.30, colorbar_title= 'Total victims' ),
                  coloraxis2=dict(colorscale='YlOrRd', colorbar_y=0.50, colorbar_thickness=23, colorbar_len=0.30, colorbar_title= 'Fatalities'), 
                coloraxis3=dict(colorscale='oryel', colorbar_y=0.75, colorbar_thickness=23, colorbar_len=0.30, colorbar_title= 'Injured'))
fig_4.show()

In the recent years mass shooting incidences have become common in almost all the months.

The maximum number of victims reported on 2017 in the month of October and second in the line was reported in the month of June in year 2016.

5. Incidences per month¶

In [79]:
df_plot=df.groupby('Month')[['Title']].count()
fig_5=px.bar(df_plot, x=df_plot.index, y='Title', labels = {'Title': 'Incidences'})
fig_5.update_layout(
        title='US Mass Shooting - Total Incidences per Month',
        xaxis=dict(
        tickmode='linear', dtick=2))

fig_5.show()

It can be seen that February is the month with highest number of Mass shooting incidences recorded. Also, winter months in general have higher number of cases.

Seasons have a demonstrated affect on mood, hostility, anger, irritability, and anxiety. Additional detailed data and further analysis can be conducted to understand if there is any relationship with season and mass shooting.

6. Fatalities vs Injured¶

In [80]:
fig_6=px.scatter(df, x='Injured', y='Fatalities', hover_name="Title", hover_data=["Date", "State", "Target_group"], color='Total victims')
fig_6.update_layout(
        title='Fatalities Vs Injured')
fig_6.show()

Las Vegas Strip mass shooting case is the outlier here with Injured being ten times the number of fatalities.

7. Shooting Incidences With Mental Health Condition¶

In [81]:
df_plot=df['Mental Health Issues'].value_counts(normalize=True)*100
fig_7=px.pie(df_plot, values=df_plot.values, names=df_plot.index, labels={'index':'Mental Health Issue'})
fig_7.update_layout(title='Percentage of Shootings with mental health issues')
fig_7.show()

32.8% of the incidences involved shooter with mental health condition while 28.8% had no mental health issue. Gor 35% of the incidences, it is unknown if the shooter had any mental heath issue.

8. Shooter Race/Gender/Mental Health condition and victims involved in shooting¶

In [82]:
fig_8 = px.sunburst(df, path=['Race', 'Gender', 'Mental Health Issues'], values='Total victims', labels={'labels':'Mental Health Issues',
                                                                                                      'id': 'Race'})
fig_8.update_layout(title= 'Shooter Race/Gender/Mental Health condition and victims involved in shooting')
fig_8.show()

9. Statewise Fatalities¶

In [83]:
fig_9 = go.Figure()
fig_9.add_trace(go.Scattergeo(
        locationmode = 'USA-states',
#     locations=df['State'], 
    text=[f'State: {x}; Fatalities: {y}; Target_group : {z}' for x,y, z in list(zip(df['State'], df['Fatalities'],df['Target_group'] ))] ,
    lat= df['Latitude'],
    lon= df['Longitude'],
    marker = dict(
                size = df['Fatalities']),

    ))

fig_9.update_layout(
        title_text = 'US Mass shooting fatalities by states',
        showlegend = True,
        geo = dict(
            scope = 'usa'
        )
    )
fig_9.show()

Nevada, Florida, Texas, California, Virginia and Connecticut have large number of fatalities.

10. Animated plot to see history of mass shooting across US states.¶

In [84]:
df_1=df.sort_values('Year', ascending=True)

fig_10= px.choropleth(df_1, locations=df['State_abb'],locationmode="USA-states", color='Fatalities'
                   ,range_color=(min(df_1.Fatalities), max(df_1.Fatalities))
                   ,hover_name="State",
                   color_continuous_scale=px.colors.sequential.Plasma,
                   animation_frame="Year",
                           scope="usa"
                          )
fig_10.update_layout(title='State wise fatalities from 1966 to 2017 <br>(Please select the Play button for animation)')
fig_10.show()

It can be seen that in the recent years more states have mass shooting cases reported tjhan ever before.

11. Incidences per state¶

In [86]:
states_abb= {
'AL': 'Alabama',
'KY': 'Kentucky',
'OH': 'Ohio',
'AK': 'Alaska',
'LA': 'Louisiana',
'OK': 'Oklahoma',
'AZ': 'Arizona',
'ME': 'Maine',
'OR': 'Oregon',
'AR': 'Arkansas',
'MD': 'Maryland',
'PA': 'Pennsylvania',
'AS': 'American Samoa',
'MA': 'Massachusetts',
'PR': 'Puerto Rico',
'CA': 'California',
'MI': 'Michigan',
'RI': 'Rhode Island',
'CO': 'Colorado',
'MN': 'Minnesota',
'SC': 'South Carolina',
'CT': 'Connecticut',
'MS': 'Mississippi',
'SD': 'South Dakota',
'DE': 'Delaware',
'MO': 'Missouri',
'TN': 'Tennessee',
'DC': 'District of Columbia',
'MT': 'Montana',
'TX': 'Texas',
'FL': 'Florida',
'NE': 'Nebraska',
'GA': 'Georgia',
'NV': 'Nevada',
'UT': 'Utah',
'GU': 'Guam',
'NH': 'New Hampshire',
'VT': 'Vermont',
'HI': 'Hawaii',
'NJ': 'New Jersey',
'VA': 'Virginia',
'ID': 'Idaho',
'NM': 'New Mexico',
'VI': 'Virgin Islands',
'IL': 'Illinois',
'NY': 'New York',
'WA': 'Washington',
'IN': 'Indiana',
'NC': 'North Carolina',
'WV': 'West Virginia',
'IA': 'Iowa',
'ND': 'North Dakota',
'WI': 'Wisconsin',
'KS': 'Kansas',
'MP': 'Northern Mariana Islands',
'WY': 'Wyoming',
'TT': 'Trust Territories'
}
In [87]:
#creating a df for the dictionary
df2=pd.Series(states_abb).to_frame('Names').reset_index()
In [88]:
#saving the value count of df['State_abb'] in a df
df3=df['State_abb'].value_counts().to_frame('Incidences').reset_index()
In [89]:
#merging the two dataframe
df4=pd.merge(df2, df3, how='left', left_on='index', right_on='index')
df4
df4.fillna(0, inplace=True)
In [90]:
fig_11= px.choropleth(df4, 
                   locations='index',locationmode="USA-states", color='Incidences'
                   ,hover_name="Names",
                           scope="usa"
                          )
fig_11.add_scattergeo(
    locations=df4.loc[df4['Incidences']==0]['index'],
    locationmode="USA-states",
    mode='markers',
    text='No Incidences reported',
    marker=dict(size=12, symbol="diamond", line=dict(width=2, color="DarkSlateGrey"))
)
fig_11.update_layout(title='State wise fatalities from 1966 to 2017 <br>(States with diamond marker have no incidences of mass shooting)')
fig_11.show()
  • California has largest number of mass shooting cases.
  • North Dakota, Rhode Island, New Hampshire, and Washington DC have no mass shooting incidences.

12. Target group VS Cause behind mass shooting¶

In [91]:
df.loc[df['Cause']=='domestic disputer', 'Cause']='domestic dispute'
In [92]:
df6=pd.crosstab(df['Cause'], df['Target_group'], values=df['Total victims'], aggfunc='sum').fillna(0)
fig_12 = go.Figure(data=go.Heatmap( z=df6.loc[:, df6.columns != 'Random'],
                               y=df6.loc[:, df6.columns != 'Random'].index,
                               x=df6.loc[:, df6.columns != 'Random'].columns,
                               colorscale="Reds"))
fig_12.update_layout(title='Cause Vs Target Group <br> (Excluded the target_group Random as majority of cases involved a random target)')
fig_12.show()
  • Students are the primary target of the the terrorists.
  • Psychos mainly target the family members.
  • Coworkers, Ex-wife and Police men are the next most target section of the community.

13. Race vs Age Group¶

In [93]:
fig_13 = go.Figure(data=go.Heatmap( z=pd.crosstab(df['Race'], df['Age Group']),
                               y=pd.crosstab(df['Race'], df['Age Group']).index,
                               x=pd.crosstab(df['Race'], df['Age Group']).columns,
                               colorscale="Reds"))
# fig.update_layout(xticks=pd.crosstab(df['Race'], df['Age Group']).index)
fig_13.show()
  • White people between age 17 to 46 are involved in most of the shootings followed by Blacks and Other races.